Skip to content

feat(compute): add EC2 fleet compute strategy#31

Closed
MichaelWalker-git wants to merge 16 commits intomainfrom
feat/compute-strategy
Closed

feat(compute): add EC2 fleet compute strategy#31
MichaelWalker-git wants to merge 16 commits intomainfrom
feat/compute-strategy

Conversation

@MichaelWalker-git
Copy link
Copy Markdown
Contributor

Summary

  • Add EC2 fleet compute strategy with SSM Run Command dispatch — a third compute backend alongside AgentCore (default) and ECS Fargate
  • New Ec2ComputeStrategy handler: finds idle instances via tags, uploads payload to S3, dispatches via SSM AWS-RunShellScript, polls GetCommandInvocation, cancels with CancelCommand
  • New Ec2AgentFleet CDK construct: Auto Scaling Group with launch template (AL2023 ARM64), security group (443 egress only), S3 payload bucket, IAM role with scoped permissions, Docker user data for pre-pulling images
  • Wire orchestrator polling, cancel-task SSM dispatch, and task-api SSM permissions for EC2
  • Stack wiring is commented-out (same pattern as ECS) — ready to enable per-repo via blueprint compute_type: 'ec2'
  • Add instance_type field to RepoConfig and BlueprintConfig for future GPU/custom instance type support

Test plan

  • mise //cdk:compile — no TypeScript errors
  • mise //cdk:test — 43 suites, 697 tests all passing (including new ec2-strategy and ec2-agent-fleet tests)
  • mise //cdk:synth — synthesizes without errors (EC2 block commented out)
  • mise //cdk:build — full build including lint passes
  • Deploy with EC2 block uncommented and run an end-to-end task with compute_type: 'ec2'

MichaelWalker-git and others added 16 commits April 9, 2026 13:30
…AgentCore logic

Introduce ComputeStrategy interface with SessionHandle/SessionStatus types
and resolveComputeStrategy factory. Extract AgentCoreComputeStrategy from
orchestrator.ts. Refactor orchestrate-task handler to use strategy pattern
for session lifecycle (start/poll/stop). Pure refactor — no behavior change,
identical CloudFormation output.
The mise install step downloads tools (trivy) from GitHub releases.
Without GITHUB_TOKEN, unauthenticated requests hit the 60 req/hr
rate limit, causing flaky CI failures.
Mise uses GITHUB_API_TOKEN (not GITHUB_TOKEN) for authenticated
GitHub API requests when downloading aqua tools like trivy.
Trivy, grype, semgrep, osv-scanner, and gitleaks are only needed for
security scanning tasks, not for the build/test/synth pipeline. Disable
them via MISE_DISABLE_TOOLS to avoid GitHub API rate limits when mise
tries to download them on every PR build.
- Keep gitleaks and osv-scanner enabled in CI build (only disable
  trivy/grype/semgrep which need GitHub API downloads)
- Remove unused @aws-sdk/client-bedrock-agentcore mock from
  orchestrate-task.test.ts (SDK is no longer imported by orchestrator)
- Update PR description to note additive strategy_type event field
1. Single source of truth for runtimeArn — removed constructor param,
   strategy now reads exclusively from blueprintConfig.runtime_arn
2. Lazy singleton for BedrockAgentCoreClient — module-level shared
   client avoids creating new TLS sessions per invocation
3. ComputeType union type ('agentcore' | 'ecs') with exhaustive switch
   and never-pattern in resolveComputeStrategy
4. Differentiated error handling in stopSession — ResourceNotFoundException
   (info), ThrottlingException/AccessDeniedException (error), others (warn)
5. Added logger.info('Session started') after full invoke+transition+event
   sequence in orchestrate-task.ts
6. Added start-session-composition.test.ts with integration tests for
   happy path, error path (failTask), and partial failure (transitionTask throws)
7. pollSession now throws NotImplementedError instead of returning stale
   'running' status — clear signal for future developers
- Replace require() with ES import for BedrockAgentCoreClient mock
- Fix import ordering in start-session-composition test
Wire ECS Fargate as a compute backend behind the existing ComputeStrategy
interface, using the existing durable Lambda orchestrator. No separate
stacks or Step Functions — ECS is a strategy option alongside AgentCore.

Changes:
- EcsComputeStrategy: startSession (RunTask), pollSession (DescribeTasks
  state mapping), stopSession (StopTask with graceful error handling)
- EcsAgentCluster construct: ECS Cluster (container insights), Fargate
  task def (2 vCPU/4GB/ARM64), security group (TCP 443 egress only),
  CloudWatch log group, task role (DynamoDB, SecretsManager, Bedrock)
- TaskOrchestrator: optional ECS props for env vars and IAM policies
  (ecs:RunTask/DescribeTasks/StopTask conditioned on cluster ARN,
  iam:PassRole conditioned on ecs-tasks.amazonaws.com)
- Orchestrator polling: ECS compute-level crash detection alongside
  existing DDB polling (non-fatal, wrapped in try/catch)
- AgentStack: conditional ECS infrastructure (ABCA_ENABLE_ECS env var)
- Full test coverage: 15 ECS strategy tests, 9 construct tests,
  5 orchestrator ECS tests. All 563 tests pass.

Deployed and verified: stack deploys cleanly, CDK synth passes cdk-nag,
agent task running on AgentCore path unaffected.
- Keep gitleaks/osv-scanner enabled in CI (only disable trivy/grype/semgrep)
- Type ComputeStrategy.type and SessionHandle.strategyType as ComputeType
- Trim/filter ECS_SUBNETS to handle whitespace and trailing commas
- Handle undefined exit code in ECS pollSession (container never started)
- Scope iam:PassRole to specific ECS task/execution role ARNs
- Validate all-or-nothing ECS props in TaskOrchestrator constructor
- Remove dead hasEcsBlueprint detection; document env-flag driven approach
- Add comment noting strategy_type as additive event field
The ECS container's default CMD starts uvicorn server:app which waits
for HTTP POST to /invocations — but in standalone ECS nobody sends that
request, leaving the agent idle. Override the container command to invoke
entrypoint.run_task() directly with the full orchestrator payload via
AGENT_PAYLOAD env var. Also add GITHUB_TOKEN_SECRET_ARN to the ECS task
definition base environment.
Add a third compute backend (EC2 fleet with SSM Run Command) alongside
the existing AgentCore and ECS strategies. This provides maximum
flexibility with no image size limits, configurable instance types
(including GPU), and full control over the compute environment.

New files:
- ec2-strategy.ts: ComputeStrategy implementation using EC2 tags for
  instance tracking and SSM RunShellScript for task dispatch
- ec2-agent-fleet.ts: CDK construct with ASG, launch template,
  security group, S3 payload bucket, and IAM role
- ec2-strategy.test.ts and ec2-agent-fleet.test.ts: full test coverage

Wiring:
- repo-config.ts: add 'ec2' to ComputeType, add instance_type field
- compute-strategy.ts: add EC2 SessionHandle variant and resolver case
- task-orchestrator.ts: add ec2Config prop with env vars and IAM grants
- orchestrate-task.ts: enable compute polling for EC2
- cancel-task.ts: add SSM CancelCommand for EC2 tasks
- task-api.ts: add ssm:CancelCommand permission for cancel Lambda
- agent.ts: add commented-out EC2 fleet block (same pattern as ECS)
@MichaelWalker-git
Copy link
Copy Markdown
Contributor Author

Recreating from a clean branch off main to avoid conflicts from prior commits

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a third compute backend (“EC2 fleet”) alongside the existing AgentCore and ECS options, wiring strategy selection into the orchestrator and extending CDK constructs/tests/docs to support the new backend.

Changes:

  • Introduces compute-strategy abstraction with implementations for AgentCore, ECS Fargate, and EC2 fleet (SSM Run Command + S3 payload).
  • Updates orchestrator start-session and polling to use the selected compute strategy and persist compute metadata for cancellation.
  • Adds CDK constructs/tests for ECS agent cluster and EC2 agent fleet, plus small docs/CI updates.

Reviewed changes

Copilot reviewed 31 out of 33 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
yarn.lock Adds AWS SDK clients (EC2/ECS/S3/SSM) and transitive deps.
docs/src/content/docs/design/Architecture.md Adds rationale section on separating orchestrator vs agent loops.
docs/design/ARCHITECTURE.md Same rationale section mirrored into top-level design doc.
cdk/test/handlers/start-session-composition.test.ts Integration-style orchestration step composition tests.
cdk/test/handlers/shared/strategies/agentcore-strategy.test.ts Unit tests for AgentCore compute strategy.
cdk/test/handlers/shared/strategies/ecs-strategy.test.ts Unit tests for ECS compute strategy.
cdk/test/handlers/shared/strategies/ec2-strategy.test.ts Unit tests for EC2 compute strategy.
cdk/test/handlers/shared/preflight.test.ts Normalizes compute_type casing in tests.
cdk/test/handlers/shared/compute-strategy.test.ts Tests strategy resolution for agentcore/ecs/ec2.
cdk/test/handlers/orchestrate-task.test.ts Removes older startSession tests now handled by strategies/composition tests.
cdk/test/handlers/cancel-task.test.ts Adds ECS cancellation coverage and behavior tests.
cdk/test/constructs/task-orchestrator.test.ts Adds ECS env var + IAM wiring tests.
cdk/test/constructs/task-api.test.ts Adds cancel-task ECS env var + IAM wiring tests.
cdk/test/constructs/ecs-agent-cluster.test.ts New tests for ECS cluster construct.
cdk/test/constructs/ec2-agent-fleet.test.ts New tests for EC2 fleet construct.
cdk/src/stacks/agent.ts Adds commented wiring blocks for ECS/EC2 backends.
cdk/src/handlers/shared/types.ts Persists compute_type + compute_metadata on task records.
cdk/src/handlers/shared/strategies/agentcore-strategy.ts New AgentCore compute strategy implementation.
cdk/src/handlers/shared/strategies/ecs-strategy.ts New ECS compute strategy implementation.
cdk/src/handlers/shared/strategies/ec2-strategy.ts New EC2 compute strategy implementation.
cdk/src/handlers/shared/repo-config.ts Adds ComputeType union + instance_type config field.
cdk/src/handlers/shared/orchestrator.ts Adds PollState fields + instance_type wiring; removes old startSession helper.
cdk/src/handlers/shared/compute-strategy.ts New strategy interface + resolver.
cdk/src/handlers/orchestrate-task.ts Uses compute strategies for start + compute-level polling.
cdk/src/handlers/cancel-task.ts Adds ECS StopTask + EC2 SSM CancelCommand cancellation paths.
cdk/src/constructs/task-orchestrator.ts Adds optional ECS/EC2 config env vars + IAM grants.
cdk/src/constructs/task-api.ts Adds optional ECS/EC2 cancellation wiring + IAM grants.
cdk/src/constructs/ecs-agent-cluster.ts New ECS cluster construct (Fargate task def + SG + IAM).
cdk/src/constructs/ec2-agent-fleet.ts New EC2 ASG-based fleet construct (SSM-managed instances).
cdk/src/constructs/blueprint.ts Extends blueprint compute type to include ec2.
cdk/package.json Adds AWS SDK clients needed for new strategies.
.gitignore Ignores local-docs directory.
.github/workflows/build.yml Sets GitHub token env vars for CI tools; adjusts MISE_DISABLE_TOOLS.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +99 to +108
// 3. Tag instance as busy
await getEc2Client().send(new CreateTagsCommand({
Resources: [instanceId],
Tags: [
{ Key: 'bgagent:status', Value: 'busy' },
{ Key: 'bgagent:task-id', Value: taskId },
],
}));

// 4. Build the boot command (mirrors ECS strategy env vars and Python boot command)
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The instance is tagged bgagent:status=busy before the SSM command is dispatched, but if SendCommand throws or returns no CommandId, the instance will remain stuck in busy (and with bgagent:task-id set). Wrap the dispatch in a try/catch/finally that reverts tags on failure, or tag busy only after a successful SendCommand response.

Suggested change
// 3. Tag instance as busy
await getEc2Client().send(new CreateTagsCommand({
Resources: [instanceId],
Tags: [
{ Key: 'bgagent:status', Value: 'busy' },
{ Key: 'bgagent:task-id', Value: taskId },
],
}));
// 4. Build the boot command (mirrors ECS strategy env vars and Python boot command)
// 3. Build the boot command (mirrors ECS strategy env vars and Python boot command)

Copilot uses AI. Check for mistakes.
Comment on lines +56 to +76
// The ECS container's default CMD starts the FastAPI server (uvicorn) which
// waits for HTTP POST to /invocations — but in standalone ECS nobody sends
// that request. We override the container command to invoke run_task()
// directly with the full orchestrator payload (including hydrated_context).
// This avoids the server entirely and runs the agent in batch mode.
const payloadJson = JSON.stringify(payload);

const containerEnv = [
{ name: 'TASK_ID', value: taskId },
{ name: 'REPO_URL', value: String(payload.repo_url ?? '') },
...(payload.prompt ? [{ name: 'TASK_DESCRIPTION', value: String(payload.prompt) }] : []),
...(payload.issue_number ? [{ name: 'ISSUE_NUMBER', value: String(payload.issue_number) }] : []),
{ name: 'MAX_TURNS', value: String(payload.max_turns ?? 100) },
...(payload.max_budget_usd !== undefined ? [{ name: 'MAX_BUDGET_USD', value: String(payload.max_budget_usd) }] : []),
...(blueprintConfig.model_id ? [{ name: 'ANTHROPIC_MODEL', value: blueprintConfig.model_id }] : []),
...(blueprintConfig.system_prompt_overrides ? [{ name: 'SYSTEM_PROMPT_OVERRIDES', value: blueprintConfig.system_prompt_overrides }] : []),
{ name: 'CLAUDE_CODE_USE_BEDROCK', value: '1' },
// Full orchestrator payload as JSON — the Python wrapper reads this to
// call run_task() with all fields including hydrated_context.
{ name: 'AGENT_PAYLOAD', value: payloadJson },
...(payload.github_token_secret_arn
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This strategy serializes the full orchestrator payload (including hydrated_context) into an ECS environment variable (AGENT_PAYLOAD). ECS task overrides have fairly small limits on environment size, so larger contexts can cause RunTask to fail at runtime. Consider switching to an S3-backed payload (pass an S3 URI/key in env) or another mechanism that doesn't depend on env-var size limits.

Copilot uses AI. Check for mistakes.
Comment on lines +145 to +168
} else if (computeType === 'ec2') {
// EC2-backed task — cancel the SSM command
const commandId = record.compute_metadata?.commandId;
const instanceId = record.compute_metadata?.instanceId;
if (commandId) {
try {
await ssmClient.send(new CancelCommandCommand({
CommandId: commandId,
...(instanceId && { InstanceIds: [instanceId] }),
}));
logger.info('SSM CancelCommand invoked after cancel', { task_id: taskId, command_id: commandId, request_id: requestId });
} catch (stopErr) {
logger.warn('SSM CancelCommand failed after cancel (command may already be done)', {
task_id: taskId,
request_id: requestId,
error: stopErr instanceof Error ? stopErr.message : String(stopErr),
});
}
} else {
logger.warn('EC2 task cancel skipped: missing commandId in compute_metadata', {
task_id: taskId,
request_id: requestId,
});
}
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For compute_type: 'ec2', cancel currently only calls SSM CancelCommand. If the command is cancelled mid-script, the cleanup/tag-reset section in the boot script may never run, leaving the instance stuck in bgagent:status=busy and effectively reducing fleet capacity. Consider also re-tagging the instance back to idle here (and deleting bgagent:task-id), or invoking the EC2 strategy’s stopSession logic from the cancel handler.

Copilot uses AI. Check for mistakes.
Comment on lines +179 to +235
let consecutiveEcsPollFailures = 0;
let consecutiveEcsCompletedPolls = 0;

// ECS compute-level crash detection: if DDB is not terminal, check ECS task status
if (
ddbState.lastStatus &&
!TERMINAL_STATUSES.includes(ddbState.lastStatus) &&
computeStrategy
) {
try {
const ecsStatus = await computeStrategy.pollSession(sessionHandle);
if (ecsStatus.status === 'failed') {
const errorMsg = 'error' in ecsStatus ? ecsStatus.error : 'ECS task failed';
logger.warn('ECS task failed before DDB terminal write', {
task_id: taskId,
error: errorMsg,
});
await failTask(taskId, ddbState.lastStatus, `ECS container failed: ${errorMsg}`, task.user_id, true);
return { attempts: ddbState.attempts, lastStatus: TaskStatus.FAILED };
}
if (ecsStatus.status === 'completed') {
consecutiveEcsCompletedPolls = (state.consecutiveEcsCompletedPolls ?? 0) + 1;
if (consecutiveEcsCompletedPolls >= MAX_CONSECUTIVE_ECS_COMPLETED_POLLS) {
// ECS task exited successfully but DDB never reached terminal — the agent
// likely crashed after container exit code 0 but before writing status.
logger.error('ECS task completed but DDB never caught up — failing task', {
task_id: taskId,
consecutive_completed_polls: consecutiveEcsCompletedPolls,
});
await failTask(taskId, ddbState.lastStatus, `ECS task exited successfully but agent never wrote terminal status after ${consecutiveEcsCompletedPolls} polls`, task.user_id, true);
return { attempts: ddbState.attempts, lastStatus: TaskStatus.FAILED };
}
logger.warn('ECS task completed but DDB not terminal — waiting for DDB catchup', {
task_id: taskId,
consecutive_completed_polls: consecutiveEcsCompletedPolls,
});
}
} catch (err) {
consecutiveEcsPollFailures = (state.consecutiveEcsPollFailures ?? 0) + 1;
if (consecutiveEcsPollFailures >= MAX_CONSECUTIVE_ECS_POLL_FAILURES) {
logger.error('ECS pollSession failed repeatedly — failing task', {
task_id: taskId,
consecutive_failures: consecutiveEcsPollFailures,
error: err instanceof Error ? err.message : String(err),
});
await failTask(taskId, ddbState.lastStatus, `ECS poll failed ${consecutiveEcsPollFailures} consecutive times: ${err instanceof Error ? err.message : String(err)}`, task.user_id, true);
return { attempts: ddbState.attempts, lastStatus: TaskStatus.FAILED };
}
logger.warn('ECS pollSession check failed (non-fatal)', {
task_id: taskId,
consecutive_failures: consecutiveEcsPollFailures,
error: err instanceof Error ? err.message : String(err),
});
}
}

return { ...ddbState, consecutiveEcsPollFailures, consecutiveEcsCompletedPolls };
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compute-level polling block is gated on blueprintConfig.compute_type === 'ecs' || 'ec2', but the variables/log messages/errors are all ECS-specific (e.g., consecutiveEcs*, "ECS container failed"). This will produce misleading failures for EC2 and makes the logic harder to extend. Consider renaming this to backend-neutral terminology and branching error messages based on sessionHandle.strategyType (or splitting ECS vs EC2 polling).

Suggested change
let consecutiveEcsPollFailures = 0;
let consecutiveEcsCompletedPolls = 0;
// ECS compute-level crash detection: if DDB is not terminal, check ECS task status
if (
ddbState.lastStatus &&
!TERMINAL_STATUSES.includes(ddbState.lastStatus) &&
computeStrategy
) {
try {
const ecsStatus = await computeStrategy.pollSession(sessionHandle);
if (ecsStatus.status === 'failed') {
const errorMsg = 'error' in ecsStatus ? ecsStatus.error : 'ECS task failed';
logger.warn('ECS task failed before DDB terminal write', {
task_id: taskId,
error: errorMsg,
});
await failTask(taskId, ddbState.lastStatus, `ECS container failed: ${errorMsg}`, task.user_id, true);
return { attempts: ddbState.attempts, lastStatus: TaskStatus.FAILED };
}
if (ecsStatus.status === 'completed') {
consecutiveEcsCompletedPolls = (state.consecutiveEcsCompletedPolls ?? 0) + 1;
if (consecutiveEcsCompletedPolls >= MAX_CONSECUTIVE_ECS_COMPLETED_POLLS) {
// ECS task exited successfully but DDB never reached terminal — the agent
// likely crashed after container exit code 0 but before writing status.
logger.error('ECS task completed but DDB never caught up — failing task', {
task_id: taskId,
consecutive_completed_polls: consecutiveEcsCompletedPolls,
});
await failTask(taskId, ddbState.lastStatus, `ECS task exited successfully but agent never wrote terminal status after ${consecutiveEcsCompletedPolls} polls`, task.user_id, true);
return { attempts: ddbState.attempts, lastStatus: TaskStatus.FAILED };
}
logger.warn('ECS task completed but DDB not terminal — waiting for DDB catchup', {
task_id: taskId,
consecutive_completed_polls: consecutiveEcsCompletedPolls,
});
}
} catch (err) {
consecutiveEcsPollFailures = (state.consecutiveEcsPollFailures ?? 0) + 1;
if (consecutiveEcsPollFailures >= MAX_CONSECUTIVE_ECS_POLL_FAILURES) {
logger.error('ECS pollSession failed repeatedly — failing task', {
task_id: taskId,
consecutive_failures: consecutiveEcsPollFailures,
error: err instanceof Error ? err.message : String(err),
});
await failTask(taskId, ddbState.lastStatus, `ECS poll failed ${consecutiveEcsPollFailures} consecutive times: ${err instanceof Error ? err.message : String(err)}`, task.user_id, true);
return { attempts: ddbState.attempts, lastStatus: TaskStatus.FAILED };
}
logger.warn('ECS pollSession check failed (non-fatal)', {
task_id: taskId,
consecutive_failures: consecutiveEcsPollFailures,
error: err instanceof Error ? err.message : String(err),
});
}
}
return { ...ddbState, consecutiveEcsPollFailures, consecutiveEcsCompletedPolls };
let consecutiveComputePollFailures = 0;
let consecutiveComputeCompletedPolls = 0;
const computeBackendLabel = sessionHandle.strategyType === 'ec2' ? 'EC2' : 'ECS';
// Compute-level crash detection: if DDB is not terminal, check compute session status.
if (
ddbState.lastStatus &&
!TERMINAL_STATUSES.includes(ddbState.lastStatus) &&
computeStrategy
) {
try {
const computeStatus = await computeStrategy.pollSession(sessionHandle);
if (computeStatus.status === 'failed') {
const errorMsg =
'error' in computeStatus ? computeStatus.error : `${computeBackendLabel} task failed`;
logger.warn(`${computeBackendLabel} task failed before DDB terminal write`, {
task_id: taskId,
error: errorMsg,
});
await failTask(
taskId,
ddbState.lastStatus,
`${computeBackendLabel} compute failed: ${errorMsg}`,
task.user_id,
true,
);
return { attempts: ddbState.attempts, lastStatus: TaskStatus.FAILED };
}
if (computeStatus.status === 'completed') {
consecutiveComputeCompletedPolls = (state.consecutiveEcsCompletedPolls ?? 0) + 1;
if (consecutiveComputeCompletedPolls >= MAX_CONSECUTIVE_ECS_COMPLETED_POLLS) {
// Compute session exited successfully but DDB never reached terminal —
// the agent likely crashed after compute completion but before writing status.
logger.error(`${computeBackendLabel} task completed but DDB never caught up — failing task`, {
task_id: taskId,
consecutive_completed_polls: consecutiveComputeCompletedPolls,
});
await failTask(
taskId,
ddbState.lastStatus,
`${computeBackendLabel} task exited successfully but agent never wrote terminal status after ${consecutiveComputeCompletedPolls} polls`,
task.user_id,
true,
);
return { attempts: ddbState.attempts, lastStatus: TaskStatus.FAILED };
}
logger.warn(`${computeBackendLabel} task completed but DDB not terminal — waiting for DDB catchup`, {
task_id: taskId,
consecutive_completed_polls: consecutiveComputeCompletedPolls,
});
}
} catch (err) {
consecutiveComputePollFailures = (state.consecutiveEcsPollFailures ?? 0) + 1;
if (consecutiveComputePollFailures >= MAX_CONSECUTIVE_ECS_POLL_FAILURES) {
logger.error(`${computeBackendLabel} pollSession failed repeatedly — failing task`, {
task_id: taskId,
consecutive_failures: consecutiveComputePollFailures,
error: err instanceof Error ? err.message : String(err),
});
await failTask(
taskId,
ddbState.lastStatus,
`${computeBackendLabel} poll failed ${consecutiveComputePollFailures} consecutive times: ${err instanceof Error ? err.message : String(err)}`,
task.user_id,
true,
);
return { attempts: ddbState.attempts, lastStatus: TaskStatus.FAILED };
}
logger.warn(`${computeBackendLabel} pollSession check failed (non-fatal)`, {
task_id: taskId,
consecutive_failures: consecutiveComputePollFailures,
error: err instanceof Error ? err.message : String(err),
});
}
}
return {
...ddbState,
consecutiveEcsPollFailures: consecutiveComputePollFailures,
consecutiveEcsCompletedPolls: consecutiveComputeCompletedPolls,
};

Copilot uses AI. Check for mistakes.
Comment on lines +283 to +315
// EC2 fleet compute strategy permissions (only when EC2 is configured)
if (props.ec2Config) {
this.fn.addToRolePolicy(new iam.PolicyStatement({
actions: [
'ec2:DescribeInstances',
'ec2:CreateTags',
],
resources: ['*'],
}));

this.fn.addToRolePolicy(new iam.PolicyStatement({
actions: [
'ssm:SendCommand',
'ssm:GetCommandInvocation',
'ssm:CancelCommand',
],
resources: ['*'],
}));

this.fn.addToRolePolicy(new iam.PolicyStatement({
actions: ['s3:PutObject'],
resources: [`arn:${Aws.PARTITION}:s3:::${props.ec2Config.payloadBucketName}/*`],
}));

this.fn.addToRolePolicy(new iam.PolicyStatement({
actions: ['iam:PassRole'],
resources: [props.ec2Config.instanceRoleArn],
conditions: {
StringEquals: {
'iam:PassedToService': 'ec2.amazonaws.com',
},
},
}));
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The EC2 orchestrator IAM statements are very broad (ec2:CreateTags on *, ssm:* on *) and the iam:PassRole grant conditioned on ec2.amazonaws.com doesn’t appear to be used by the EC2 strategy (it doesn’t call APIs that pass a role). Consider removing the EC2 iam:PassRole statement, and scoping EC2 tagging permissions with conditions (e.g., ec2:ResourceTag/${fleetTagKey} == fleetTagValue and limiting aws:TagKeys) to reduce blast radius.

Copilot uses AI. Check for mistakes.
Comment on lines +113 to +116
*/
readonly ec2FleetConfig?: {
readonly instanceRoleArn: string;
};
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ec2FleetConfig includes instanceRoleArn, but that value isn’t used (the construct only conditionally grants ssm:CancelCommand on *). This makes the API misleading and suggests a scoping that doesn’t exist. Either remove instanceRoleArn from the props or use it to implement a more scoped permission model (e.g., document/service role, or additional permissions needed to reset instance tags on cancel).

Suggested change
*/
readonly ec2FleetConfig?: {
readonly instanceRoleArn: string;
};
* This object currently serves only as an enablement flag for EC2-backed cancellation.
*/
readonly ec2FleetConfig?: {};

Copilot uses AI. Check for mistakes.

// Tag the ASG instances for fleet identification
// CDK auto-propagates tags from the ASG to instances
this.autoScalingGroup.node.defaultChild;
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this.autoScalingGroup.node.defaultChild; is a no-op statement and can be removed. If the intent was to access/modify the underlying CFN resource, assign it to a variable and use it explicitly (otherwise this line adds noise without effect).

Suggested change
this.autoScalingGroup.node.defaultChild;

Copilot uses AI. Check for mistakes.
'',
'# Set environment variables',
...envExports,
'',
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The EC2 boot script uses aws ecr get-login-password --region $AWS_REGION, but AWS_REGION is never set in the script. On stock AL2023 instances this is typically unset, which will make ECR login/pull fail. Consider deriving the region from instance metadata (like the user-data does) and exporting AWS_REGION/AWS_DEFAULT_REGION before the AWS CLI and docker run steps.

Suggested change
'',
'',
'# Resolve AWS region from instance metadata',
'IMDS_TOKEN=$(curl -sS -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")',
'AWS_REGION=$(curl -sS -H "X-aws-ec2-metadata-token: $IMDS_TOKEN" http://169.254.169.254/latest/dynamic/instance-identity/document | sed -n \'s/.*"region"[[:space:]]*:[[:space:]]*"\\([^"]*\\)".*/\\1/p\')',
'export AWS_REGION',
'export AWS_DEFAULT_REGION="$AWS_REGION"',
'',

Copilot uses AI. Check for mistakes.
Comment on lines +127 to +130
'# Fetch payload from S3',
`aws s3 cp "s3://${EC2_PAYLOAD_BUCKET}/${payloadKey}" /tmp/payload.json`,
'export AGENT_PAYLOAD=$(cat /tmp/payload.json)',
'',
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

export AGENT_PAYLOAD=$(cat /tmp/payload.json) is unsafe: command substitution strips newlines and can mangle JSON, and large hydrated payloads may exceed shell/env-var limits. Prefer keeping the payload as a file (e.g., mount /tmp/payload.json into the container and have the Python entrypoint read it) or at least quote the assignment to preserve content.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants